Distributed Stochastic Gradient Descent with Compressed and Skipped Communication
نویسندگان
چکیده
This paper introduces CompSkipDSGD, a new algorithm for distributed stochastic gradient descent that aims to improve communication efficiency by compressing and selectively skipping communication. In addition compression, CompSkipDSGD allows both workers the server skip in any iteration of training process reserve it future iterations without significantly decreasing testing accuracy. Our experimental results on large-scale ImageNet dataset demonstrate can save hundreds gigabytes while maintaining similar levels accuracy compared state-of-the-art algorithms. The are supported theoretical analysis demonstrates convergence under established assumptions. Overall, could be useful reducing costs deep learning enabling use datasets models complex environments.
منابع مشابه
Communication-Efficient Distributed Stochastic Gradient Descent with Butterfly Mixing
Stochastic gradient descent is a widely used method to find locally-optimal models in machine learning and data mining. However, it is naturally a sequential algorithm, and parallelization involves severe compromises because the cost of synchronizing across a cluster is much larger than the time required to compute an optimal-sized gradient step. Here we explore butterfly mixing, where gradient...
متن کاملQuantized Stochastic Gradient Descent: Communication versus Convergence
Parallel implementations of stochastic gradient descent (SGD) have received signif1 icant research attention, thanks to excellent scalability properties of this algorithm, 2 and to its efficiency in the context of training deep neural networks. A fundamental 3 barrier for parallelizing large-scale SGD is the fact that the cost of communicat4 ing the gradient updates between nodes can be very la...
متن کاملHigh Throughput Synchronous Distributed Stochastic Gradient Descent
We introduce a new, high-throughput, synchronous, distributed, data-parallel, stochasticgradient-descent learning algorithm. This algorithm uses amortized inference in a computecluster-specific, deep, generative, dynamical model to perform joint posterior predictive inference of the mini-batch gradient computation times of all worker-nodes in a parallel computing cluster. We show that a synchro...
متن کاملVariance Reduction for Distributed Stochastic Gradient Descent
Variance reduction (VR) methods boost the performance of stochastic gradient descent (SGD) by enabling the use of larger, constant stepsizes and preserving linear convergence rates. However, current variance reduced SGD methods require either high memory usage or an exact gradient computation (using the entire dataset) at the end of each epoch. This limits the use of VR methods in practical dis...
متن کاملSparse Communication for Distributed Gradient Descent
We make distributed stochastic gradient descent faster by exchanging sparse updates instead of dense updates. Gradient updates are positively skewed as most updates are near zero, so we map the 99% smallest updates (by absolute value) to zero then exchange sparse matrices. This method can be combined with quantization to further improve the compression. We explore different configurations and a...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE Access
سال: 2023
ISSN: ['2169-3536']
DOI: https://doi.org/10.1109/access.2023.3315331